Contact Zones R Code Tutorial

Author

Den Jackson

The following is a tutorial explaining the R code used for my Contact Zone analysis.

Code
library(tidyverse)
# Load the data in. You may need to change file locations if running yourself
games <- read.csv("games.csv") # Has home/visiting teams and scores
players <- read.csv("players.csv") # Has player position (not coords)
plays <- read.csv("plays.csv") # Has ballCarrierId
tackles <- read.csv("tackles.csv") # Has official recorded tacklers
week1 <- read.csv("tracking_week_1.csv") # Player tracking data

Initial Data Filtering

The first step in our process is to trim the data down and make sure we are only grabbing plays that are relevant for our analysis. To do this, I coded a function that takes any NFL Next Gen Stats weekly tracking dataset and filters it for us. This function will work for any individual week or any combination of weeks. For our purposes, I will only be running it with the first week of data. This function also adds “doi_ind,” a variable that indiciates if the defender is a defender of interest (see my paper for more info).

Code
# Function that narrows down the data to only grab plays where "tackle" is 
# recorded as an event. Also adds doi_ind
getPlays <- function(week_data) {
  # Get every frame with tackle or qb_sack recorded
  tackleFrames <- week_data[week_data$event %in% c("tackle", "qb_sack"),]
  # Get unique game and play codes for each of these
  gamePlayCodes <- unique(tackleFrames[1:2])
  # Combine the two codes to make a single indicator
  tackleGamePlayInds <- paste(gamePlayCodes$gameId, gamePlayCodes$playId, 
                              sep = "-")
  
  # Now create this same indicator but for every row in the weekly data
  week_data$gamePlayId <- paste(week_data$gameId, week_data$playId, sep = "-")
  
  # Now only grab the plays in the weekly data that have matching gamePlayInds
  filteredPlays <- week_data[week_data$gamePlayId %in% tackleGamePlayInds,]
  
  # Give the football the label "B" for ball
  # filteredPlays[filteredPlays$displayName == "football", "jerseyNumber"] <- "B"
  
  # Get all the defenders of interest (DBs and LBs, everyone but the d-line)
  defOfInterest <- players[players$position %in% c("CB", "DB", "FS", "ILB", "MLB",
                                                   "OLB", "SS"), c(1, 6, 7)]
  # Add indicator for if player is a defender of interest or not
  filteredPlays$doi_ind <- ifelse(filteredPlays$nflId%in%defOfInterest$nflId,1,0)
  
  return(filteredPlays)
}

# Used for plotting later
week1Filtered <- getPlays(week1)

At this point we also save the data, as it will come in handy for plotting plays later.

Evaluating Contact Zones

evalContactZones()

In my code, the entire next step all falls under one function evalCOntactZones(). For ease of understanding the parts, I will look at it in separate chunks here, but keep in mind that all the code here falls under a function and is not meant to be run independently.

The first thing we do is some extra filtering. We create our unique gamePlayId for some of the other datasets and narrow down our data to keep only what is most relevant. We also grab some information that will be useful later; namely, the coordinates of the ball-carrier and the frame that the event “tackle” was recorded on for every play.

Code
# Create the id column for play_df and tackle_df
play_df$gamePlayId <- paste(play_df$gameId, play_df$playId, sep = "-")
tackle_df$gamePlayId <- paste(tackle_df$gameId, tackle_df$playId, sep = "-")

# Filter down the data to only contain the relevant columns
weekDf <- filteredPlays[,c(18, 3:5, 10, 11, 17, 19)]

# This adds ballCarrierId to our week data 
weekDf <- left_join(weekDf, play_df[,c(36, 3)], by = "gamePlayId")

# For every gamePlayId, grab the ball-carrier's coordinates at every frame
bcCoords <- weekDf %>% 
  filter(nflId == ballCarrierId) %>% 
  select(gamePlayId, frameId, x, y)

# Get the frame the tackle was recorded to have occurred on for each play
tackleFrames <- weekDf %>% 
  filter(event == "tackle") %>% 
  group_by(gamePlayId) %>% 
  slice(1) %>% # We only need the first row to get the correct frameId
  select(gamePlayId, frameId)

Next we initialize our results list which will be returned as a dataframe at the end of the function. We add the user-specified value for radius on so that when we iterate over this function, we know what zone size we are testing. We also create an important function called defInfo that also gets returned at the end of the function. This dataframe has very important information on all our defenders of interest for every play they show up in; this is where we get distances to the ball-carrier and an indicator that lets us know if the ball-carrier has entered a defender’s zone.

Code
# Initialize a results dataframe to be returned at the end
results <- list()
results$radius <- radius

# Now create a week df that only has the DOIs and add the ball carrier coords
# as well as an indicator if the BC is in a DOI's zone
defInfo <- weekDf %>% 
  filter(doi_ind == 1) %>% 
  left_join(bcCoords, by = c("gamePlayId", "frameId"), 
            suffix = c(".doi", ".bc")) %>% 
  group_by(frameId) %>% 
  mutate(distToBC = sqrt((x.doi - x.bc)^2 + (y.doi - y.bc)^2),
         bcInZone = ifelse(distToBC < radius, 1, 0))

We have to do some initial filtering here because of an issue that arises with pass plays. Sometimes the ball-carrier is already in a zone before they even have the ball, so it looks like they are already being tackled even before the ball gets to them. We want to basically deactivate the bcInZone indicator until the ball-carrier actually comes into possession of the ball, which is what the code below does.

Code
ballArrivals <- weekDf %>% 
  filter(event %in% c("pass_arrived", "pass_outcome_caught")) %>%
  group_by(gamePlayId) %>%
  summarize(ballArrivalFrame = min(frameId))
  
  defInfo <- defInfo %>% 
  left_join(ballArrivals, by = "gamePlayId") %>%
  # If pass play, turn all bcInZone values before the pass arrives to 0
  mutate(bcInZone = ifelse(!is.na(ballArrivalFrame) & 
                             frameId < ballArrivalFrame, 0, bcInZone))

Next we do some important feature creation that will help us create the statistics we will later use to evaluate our zone sizes. The most important things we create here are three variables for each play: - numZonesEntered: How many unique contact zones did a ball-carrier enter throughout the course of a play - tacklerIds: the IDs of the players those zones belong to - firstEntryDiff: the difference between the frames where “first_contact” occurs and where the ball-carrier first enters a zone of contact. We then add the mean and median of firstEntryDiff to the results dataframe.

Code
# Add the frame first_contact first appears, used to compare with first zone
# entry
fcFirstFrames <- defInfo %>% 
  filter(event == "first_contact") %>% 
  group_by(gamePlayId) %>% 
  summarize(fcFrame = min(frameId))

# Count how many zones the BC enters in each play
zoneEntries <- defInfo %>%
  left_join(tackleFrames, by = "gamePlayId") %>% 
  left_join(fcFirstFrames, by = "gamePlayId") %>% 
  # Only grab times where the BC was in the zone BEFORE the event "tackle"
  filter(bcInZone == 1, frameId.x < frameId.y) %>% 
  group_by(gamePlayId) %>%
  summarize(numZonesEntered = length(unique(nflId)),
            tacklerIds = list(unique(nflId)),
            firstEntryDiff = min(fcFrame) - min(frameId.x)) %>% 
  filter(gamePlayId != "2022091102-1517") # No tackle recorded for this play

results$meanEntryDiff <- mean(zoneEntries$firstEntryDiff, na.rm = T)
results$medianEntryDiff <- median(zoneEntries$firstEntryDiff, na.rm = T)

Next we create the last 3 metrics: correctPct, noneMissingPct, and extraPct. We get a dataframe that consists of vectors with the officially credited tacklers’ IDs and compare those to the vector of tacklerIds we created in the previous code chunk. We then use these results to calculate the remaining metrics, which are added to results.

Code
# Now we can compare these numbers to the numbers of DOIs credited on a tackle
filteredTackles <- tackle_df[tackle_df$gamePlayId%in%zoneEntries$gamePlayId,]

officialTacklers <- filteredTackles %>% 
  group_by(gamePlayId) %>% 
  summarize(numTacklers = length(unique(nflId)),
            officialTacklerIds = list(unique(nflId)))

compareIdsDf <- zoneEntries %>% 
  left_join(officialTacklers, by = "gamePlayId") %>% 
  mutate(
    # Boolean column that is 1 if all predicted are accurate, 0 if not
    correct = map2_int(tacklerIds, officialTacklerIds, ~ setequal(.x, .y)),
    # Count of players we missed (get rid of length for actual IDs)
    missingIds = map2_int(officialTacklerIds, tacklerIds, 
                          ~ length(setdiff(.x, .y))),
    # Count of players we accidentally counted (remove length for actual IDs)
    extraIds = map2_int(tacklerIds, officialTacklerIds, 
                        ~ length(setdiff(.x, .y)))
  )

# What % do we match up exactly?
results$correctPct <- mean(as.integer(compareIdsDf$correct)) # 39.38% for smaller radius

# How many are we missing?
results$noneMissingPct <- mean(compareIdsDf$missingIds == 0) # 73% where we aren't missing any

# How many are we adding?
# Complement would be the % all of our predictions were right (but could still
# be missing players)
results$extraPct <- mean(compareIdsDf$extraIds != 0) # Adding at least one extra ~68% of the time

The full function is in the chunk below, along with extra commented lines of code I had run and thought were useful but did not end up keeping in my analysis:

Code
# Function that evaluates zones of contact for a given radius
evalContactZones <- function(week_df, radius, play_df = plays, 
                             tackle_df = tackles){

  filteredPlays <- getPlays(week_df)
  
  # Create the id column for play_df and tackle_df
  play_df$gamePlayId <- paste(play_df$gameId, play_df$playId, sep = "-")
  tackle_df$gamePlayId <- paste(tackle_df$gameId, tackle_df$playId, sep = "-")
  
  # Filter down the data to only contain the relevant columns
  weekDf <- filteredPlays[,c(18, 3:5, 10, 11, 17, 19)]
  
  # This adds ballCarrierId to our week data 
  weekDf <- left_join(weekDf, play_df[,c(36, 3)], by = "gamePlayId")
  
  # For every gamePlayId, grab the ball-carrier's coordinates at every frame
  bcCoords <- weekDf %>% 
    filter(nflId == ballCarrierId) %>% 
    select(gamePlayId, frameId, x, y)
  
  # Get the frame the tackle was recorded to have occurred on for each play
  tackleFrames <- weekDf %>% 
    filter(event == "tackle") %>% 
    group_by(gamePlayId) %>% 
    slice(1) %>% # We only need the first row to get the correct frameId
    select(gamePlayId, frameId)
  
  # Initialize a results dataframe to be returned at the end
  results <- list()
  results$radius <- radius
  
  # Now create a week df that only has the DOIs and add the ball carrier coords
  # as well as an indicator if the BC is in a DOI's zone
  defInfo <- weekDf %>% 
    filter(doi_ind == 1) %>% 
    left_join(bcCoords, by = c("gamePlayId", "frameId"), 
              suffix = c(".doi", ".bc")) %>% 
    group_by(frameId) %>% 
    mutate(distToBC = sqrt((x.doi - x.bc)^2 + (y.doi - y.bc)^2),
           bcInZone = ifelse(distToBC < radius, 1, 0))
  
  # One issue that arises is when we have pass plays, sometimes the BC is 
  # already in a zone before they even have the ball, so it acts like there are
  # being tackled even before the ball gets to them. It may be more accurate to
  # only count the zone as being active after the ball arrives, as that is when
  # the tackle can legally begin. ALSO, BIG POINT: the BC may just be running a
  # route and passing through zones before they actually get the ball, and those
  # are being counted. We only want to count once they are officially a BC.
  ballArrivals <- weekDf %>% 
    filter(event %in% c("pass_arrived", "pass_outcome_caught")) %>%
    group_by(gamePlayId) %>%
    summarize(ballArrivalFrame = min(frameId))
  
  defInfo <- defInfo %>% 
    left_join(ballArrivals, by = "gamePlayId") %>%
    # If pass play, turn all bcInZone values before the pass arrives to 0
    mutate(bcInZone = ifelse(!is.na(ballArrivalFrame) & 
                               frameId < ballArrivalFrame, 0, bcInZone))
  
  # Add the frame first_contact first appears, used to compare with first zone
  # entry
  fcFirstFrames <- defInfo %>% 
    filter(event == "first_contact") %>% 
    group_by(gamePlayId) %>% 
    summarize(fcFrame = min(frameId))
  
  # Count how many zones the BC enters in each play
  zoneEntries <- defInfo %>%
    left_join(tackleFrames, by = "gamePlayId") %>% 
    left_join(fcFirstFrames, by = "gamePlayId") %>% 
    # Only grab times where the BC was in the zone BEFORE the event "tackle"
    filter(bcInZone == 1, frameId.x < frameId.y) %>% 
    group_by(gamePlayId) %>%
    summarize(numZonesEntered = length(unique(nflId)),
              tacklerIds = list(unique(nflId)),
              firstEntryDiff = min(fcFrame) - min(frameId.x)) %>% 
    filter(gamePlayId != "2022091102-1517") # No tackle recorded for this play
  
  # summary(zoneEntries$numZonesEntered)
  # summary(zoneEntries$firstEntryDiff)
  # hist(na.omit(zoneEntries$firstEntryDiff))
  # boxplot(na.omit(zoneEntries$firstEntryDiff), horizontal = T)
  results$meanEntryDiff <- mean(zoneEntries$firstEntryDiff, na.rm = T)
  results$medianEntryDiff <- median(zoneEntries$firstEntryDiff, na.rm = T)
  
  # Now we can compare these numbers to the numbers of DOIs credited on a tackle
  filteredTackles <- tackle_df[tackle_df$gamePlayId%in%zoneEntries$gamePlayId,]
  
  officialTacklers <- filteredTackles %>% 
    group_by(gamePlayId) %>% 
    summarize(numTacklers = length(unique(nflId)),
              officialTacklerIds = list(unique(nflId)))
  #summary(officialTacklers$numTacklers)
  
  # Merge these two results so that we can directly compare better
  #compareDf <- merge(zoneEntries, officialTacklers, by = "gamePlayId")
  # zoneEntries <- zoneEntries %>% 
  #   mutate(Category = 'Zones Entered', Value = numZonesEntered) %>% 
  #   select(Category, Value)
  # tacklePlayerCounts <- officialTacklers %>% 
  #   mutate(Category = 'Actual Tackler Count', Value = numTacklers) %>% 
  #   select(Category, Value)
  # compareDf <- bind_rows(zoneEntries, officialTacklers)
  # 
  # ggplot(compareDf, aes(x = as.factor(Value), fill = Category)) +
  #   geom_bar(position = "dodge") +
  #   labs(x = element_blank(), 
  #     title = "Actual Credited Tackler Count vs. Claimed Count",
  #     subtitle = "Filtering out where BC entered zone after tackle event was recorded") +
  #   theme_minimal() 
  
  compareIdsDf <- zoneEntries %>% 
    left_join(officialTacklers, by = "gamePlayId") %>% 
    mutate(
      # Boolean column that is 1 if all predicted are accurate, 0 if not
      correct = map2_int(tacklerIds, officialTacklerIds, ~ setequal(.x, .y)),
      # Count of players we missed (get rid of length for actual IDs)
      missingIds = map2_int(officialTacklerIds, tacklerIds, 
                            ~ length(setdiff(.x, .y))),
      # Count of players we accidentally counted (remove length for actual IDs)
      extraIds = map2_int(tacklerIds, officialTacklerIds, 
                          ~ length(setdiff(.x, .y)))
    )
  
  # mean(compareIdsDf$numZonesEntered > compareIdsDf$numTacklers)
  # mean(compareIdsDf$numZonesEntered < compareIdsDf$numTacklers)
  # mean(compareIdsDf$numZonesEntered == compareIdsDf$numTacklers)
  
  # What % do we match up exactly?
  results$correctPct <- mean(as.integer(compareIdsDf$correct)) # 39.38% for smaller radius
  
  # How many are we missing?
  #summary(compareIdsDf$missingIds)
  #barplot(table(compareIdsDf$missingIds))
  results$noneMissingPct <- mean(compareIdsDf$missingIds == 0) # 73% where we aren't missing any
  
  # How many are we adding?
  # Complement would be the % all of our predictions were right (but could still
  # be missing players)
  results$extraPct <- mean(compareIdsDf$extraIds != 0) # Adding at least one extra ~68% of the time
  
  # Returning defInfo for plotting purposes
  return(list(results = data.frame(results), defInfo = defInfo))
}

Results

The following code loops through a vector of potential radii candidates and computes the 5 proposed metrics for each of them.

Code
radii <- seq(0.9, 2.2, by = 0.05)

# Initialize results df
results <- data.frame(radius = numeric(), meanEntryDiff = numeric(), 
                      medianEntryDiff = numeric(), correctPct = numeric(), 
                      noneMissingPct = numeric(), extraPct = numeric())

for (rad in radii) {
  # Get results for current radius and then rbind them to results df
  tempRes <- evalContactZones(week1, radius = rad)$results
  results <- rbind(results, tempRes)
}

We can then make plots for each of the statistics (only the first two are shown in the paper:

Code
# Plot mean and median entry diffs (mean and median number of frames between the
# BC first entering the zone and the first_contact label)
plot(meanEntryDiff ~ radius, results, col = "red", type = "b", pch = 16, lty =2,
     xlab = "Radius", ylab = "Frame Difference", 
     main = "Mean and Median Entry Frame Differences")
lines(medianEntryDiff ~ radius, results, col = "black", type = "b", pch = 16)
legend("bottomright", c("Mean", "Median"), col = c("red", "black"), 
       lty = c(2, 1), pch = 16)
abline(v = 1.2, col = "blue", lty = 3)

Code
# Plot noneMissingPct across radii (% of games where we didn't miss any of the
# actual tacklers)
plot(noneMissingPct ~ radius, results, type = "b", pch = 16,
     xlab = "Radius", ylab = "Proportion of No Missing Players", 
     main = "None Missing % For Different Radii")
abline(v = results$radius[which.max(results$noneMissingPct)], 
       col = "red", lty = 2)

Code
# Plot correctPct across radii (% of games where our players matched exactly
# with the real tacklers)
plot(correctPct ~ radius, results, type = "b", pch = 16,
     xlab = "Radius", ylab = "Proportion of All Correct Players", 
     main = "All Correct % For Different Radii")

Code
# Plot extraPct across radii (% of games where we added an extra player to the
# tackle)
plot(extraPct ~ radius, results, type = "b", pch = 16,
     xlab = "Radius", ylab = "Proportion of Extra Players Added", 
     main = "Extra Player % For Different Radii")

Visualizing Plays

The following function lets you plot any game in our filtered week data, either as an animation or a still image. The still image often gets messy as it plots every frame at once, so I recommend making an animation (which is also way cooler). The full function is in the chunk below:

Code
library(ggforce)
library(gganimate)
plotGame <- function(gamePlayId, radius, week_df, animate = T) {
  # week_df is filtered week data from getPlays()
  
  # Get the data for every player (not just DOIs) from the week data
  play <- week_df[week_df$gamePlayId == gamePlayId,]
  #play <- week_df[(week_df$gamePlayId == gamePlayId) & (week_df$frameId == 1),]

  # Get defInfo for plotting the actual zones
  defInfo <- evalContactZones(play, radius)$defInfo
  defInfo <- defInfo[defInfo$gamePlayId == gamePlayId,]
  #defInfo <- defInfo[(defInfo$gamePlayId == gamePlayId)&(defInfo$frameId==1),]
  
  p <- ggplot() +
    geom_circle(data = defInfo, aes(x0 = x.doi, y0 = y.doi, r = radius, 
                                    fill = as.factor(bcInZone), group = nflId), 
                col = NA, alpha = 1) +
    scale_fill_manual(values = c("0" = "gray", "1" = "green"), guide = "none")+
    # Everybody but the football gets a label
    geom_text(data = subset(play, club != "football"),
              aes(x = x, y = y, col = club, label = jerseyNumber),
              fontface = "bold") +
    scale_color_manual(values = c("red", "blue")) +
    # Add a brown dot to act as the football
    geom_point(data = subset(play, club == "football"), 
               aes(x = x, y = y), fill = "darkorange4", shape = 21, 
               size = 3) +
    coord_fixed() +
    theme_minimal() +
    theme(panel.background = element_rect(fill = "darkseagreen3")) +
    guides(col = guide_legend(title = "Team")) +
    labs(title = paste0("Game ", gamePlayId))
  if (animate) {
    p <- p + transition_time(frameId) +
      labs(title = paste0("Game ", gamePlayId),
           subtitle = "Frame: {frame_time}")
    return(animate(p, nframes = length(play$frameId) / 23, fps = 10))
  } else {
    return(p)
  }
}

Here is the game I plotted in the paper, but now you can see the full animation:

Code
# I used this to sample random games to watch to visually test zones
#sample(unique(defInfo$gamePlayId), 1)

#Actual game: (play starts at 41:52)
#https://nfl-video.com/new-orleans-saints-vs-atlanta-falcons-full-game-replay-2022-nfl-week-1

# Use the week1Filtered data from the getPlays() function
plotGame("2022091100-1260", 1.2, week1Filtered, animate = T)
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `ballArrivalFrame = min(frameId)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf

Here is a clip of the actual play itself:

To get the first figure shown in the data (the single frame of a play), I used a specialized version of this function where I isolated just the first frame. You can still see the commented out chunks containing those, if you uncommented them and set animate = F you would get just the first frame of data.